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Abstract — This  paper  describes  our  participation  in  the  TREC 
Legal  competition  in  2008.  Our  first  set  of  experiments  involved 
the  use  of  Latent  Semantic  Indexing  (LSI)  with  a  small  number 
of  dimensions,  a  technique  we  refer  to  as  Essential  Dimensions 
of  Latent  Semantic  Indexing  (EDLSI).  Because  the  experimental 
dataset  is  large,  we  designed  a  distributed  version  of  EDLSI  to 
use  for  our  submitted  runs.  We  submitted  two  runs  using  dis¬ 
tributed  EDLSI,  one  with  k  =  10  and  another  with  k  =  41,  where 
k  is  the  dimensionality  reduction  parameter  for  LSI.  We  also 
submitted  a  traditional  vector  space  baseline  for  comparison  with 
the  EDLSI  results.  This  article  describes  our  experimental  design 
and  the  results  of  these  experiments.  We  find  that  EDLSI  clearly 
outperforms  traditional  vector  space  retrieval  using  a  variety  of 
TREC  reporting  metrics. 

We  also  describe  experiments  that  were  designed  as  a  fol¬ 
lowup  to  our  TREC  Legal  2007  submission.  These  experiments 
test  weighting  and  normalization  schemes  as  well  as  techniques 
for  relevance  feedback.  Our  primary  intent  was  to  compare  the 
BM25  weighting  scheme  to  our  power  normalization  technique. 
BM25  outperformed  all  of  our  other  submissions  on  the  competi¬ 
tion  metric  (FI  at  K)  for  both  the  ad  hoc  and  relevance  feedback 
tasks,  but  Power  normalization  outperformed  BM25  in  our  ad  hoc 
experiments  when  the  2007  metric  (estimated  recall  at  B)  was 
used  for  comparison. 


1  Introduction 

In  the  2007  TREC  Legal  competition,  our  submissions 
were  designed  to  test  the  effectiveness  of  a  variety  of 
simple  normalization  schemes  on  a  large  dataset.  We 
had  planned  a  comparison  with  some  of  the  state-of-the- 
art  systems  for  weighting  and  normalization,  but  these 
activities  were  not  completed  in  time  for  the  TREC 
2007  submissions.  Thus,  in  2008  one  of  our  primary 
goals  was  to  compare  power  normalization  J9)  to  the 
BM25  weighting  scheme  ED.  We  submitted  five  ad  hoc 
runs  and  six  relevance  feedback  runs  that  compare  these 
algorithms. 


We  have  extensive  experience  with  Latent  Semantic 
Indexing  (LSI)  0,  Q,  0,  ED,  Ell,  ED-  Thus  we 
were  also  eager  to  see  how  LSI  would  work  on  the  IIT 
Complex  Document  Information  Processing  (IIT  CDIP) 
test  collection,  which  contains  approximately  7  million 
documents  (57  GB  of  uncompressed  text).  Specifically, 
we  wanted  to  see  if  the  Essential  Dimensions  of  Latent 
Semantic  Indexing  (EDLSI)  0  approach  would  scale  to 
this  large  collection  and  what  the  optimal  k  value  would 
be.  We  have  used  SVD  updating,  folding-in,  and  folding- 
up  in  previous  work  E3,  EH,  ED,  and  it  appeared 
that  these  techniques  would  be  useful  for  handling  a 
collection  of  this  size. 

This  year,  teams  participating  in  the  TREC  Legal  task 
were  required  to  indicate  the  K  and  Kh  value  for  each 
query.  K  is  the  threshold  at  which  the  system  believes 
the  competing  demands  of  recall  and  precision  are  best 
balanced  (using  the  LI  @K  measure  shown  in  Equation 
Q}.  and  Kh  is  the  corresponding  threshold  for  highly 
relevant  documents.  Assessors  assigned  a  value  of  highly 
relevant  when  judging  documents  for  the  first  time  this 
year.  Much  of  our  preparatory  work  for  the  competition 
centered  on  determining  appropriate  ways  to  set  K  and 
Kh  for  each  query. 

Fl@K  =  (2  *  P@I<  *  R@K) /( P@I<  +  R@K)  (1) 

This  paper  is  organized  as  follows:  Sections  [2]  [3]  and 
[4]  describe  the  methodologies  used.  Section  [5]  discusses 
our  approach  for  finding  the  optimal  K  and  Kh  values 
for  each  query.  Section  [6]  describes  our  experiments  and 
results  when  EDLSI  was  used  in  the  ad  hoc  task.  Section 
[7]  details  our  power  normalization  and  BM25  ad  hoc 
experiments  and  results.  Section  [8]  discusses  our  rele¬ 
vance  feedback  submissions  with  power  normalization 
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Truncated  Term  by  Dimension 

Term  by  Document 


and  BM25.  Section  [9] presents  our  conclusions. 

2  Essential  Dimensions  of  LSI 

In  this  section  we  first  describe  LSI.  We  then  discuss 
EDLSI  and  how  it  improves  upon  LSI.  We  also  detail 
the  distributed  EDLSI  approach  we  used  for  our  TREC 
Legal  2008  submissions. 

2.1  Latent  Semantic  Indexing 

LSI  is  a  well-known  approach  to  information  retrieval 
that  is  believed  to  reduce  the  problems  associated  with 
polysemy  and  synonymy  0.  At  the  heart  of  LSI  is  a 
matrix  factorization,  the  singular  value  decomposition 
(SVD),  which  is  used  to  express  the  t  x  d  term-by- 
document  matrix  as  a  product  of  three  matrices,  a  term 
component  (T),  a  “diagonal”  matrix  of  nonnegative  sin¬ 
gular  values  in  non-increasing  order  (S),  and  a  document 
component  (D).  This  is  shown  pictorially  in  Figure  Q] 
taken  from  0.  The  original  term-by-document  matrix 
A  is  then  given  by  A  =  TSDt.  In  LSI,  these  matrices 
are  truncated  to  k  <C  min(f,  d)  dimensions  by  zeroing 
elements  in  the  singular  value  matrix  S.  In  practice,  fast 
algorithms  exist  to  compute  only  the  required  k  dominant 
singular  values  and  the  associated  vectors  in  T  and  D  JT|. 

The  primary  problem  with  LSI  is  that  the  optimal 
number  of  dimensions  k  to  use  for  truncation  is  depen¬ 
dent  upon  the  collection  (4),  flOl.  Furthermore,  as  shown 
in  0,  LSI  does  not  always  outperform  traditional  vector 
space  retrieval,  especially  on  large  collections. 

2.2  EDLSI 

To  address  this  problem,  Kontostathis  developed  an 
approach  to  using  LSI  in  combination  with  traditional 
vector  space  retrieval  called  EDLSI.  EDLSI  uses  a 
convex  combination  of  the  resultant  vector  from  LSI 
and  the  resultant  vector  from  traditional  vector  space 
retrieval.  Early  results  showed  that  this  fusion  approach 
provided  retrieval  performance  that  was  better  than  either 
LSI  or  vector  space  retrieval  alone  0.  Moreover,  this 


k 
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approach  is  not  as  sensitive  to  the  k  value,  and  in  fact,  it 
performs  well  when  k  is  small  (10  —  20).  In  addition  to 
providing  better  retrieval  performance,  keeping  k  small 
provides  significant  runtime  benefits.  Fewer  SVD  dimen¬ 
sions  need  to  be  computed,  and  memory  requirements 
are  reduced  because  fewer  columns  of  the  dense  D  and 
T  matrices  must  be  stored. 

The  computation  of  the  resultant  vector  w  using 
EDLSI  is  shown  in  Equation  [2]  where  a;  is  a  weighting 
factor  (0  <  x  <  1)  and  k  is  kept  small.  In  this 
equation,  A  is  the  original  term-by-document  matrix, 
Ak  is  the  term-by-document  matrix  after  truncation  to 
k  dimensions,  and  q  is  the  query  vector. 

w  =  (x)(qTAk)  +  (1  -  x)(qTA)  (2) 


2.3  Distributed  EDLSI 

Although  the  runtime  requirements,  specifically  the 
memory,  for  EDLSI  are  reduced  over  LSI,  they  are  still 
significant,  especially  for  a  corpus  the  size  of  IIT  CDIP. 
After  indexing,  we  produced  a  term-by-document  matrix 
that  contained  486,654  unique  terms  and  6,827,940  doc¬ 
uments.  Our  indexing  system  used  log-entropy  weighting 
and  OCR  error  detection.  Details  can  be  found  in  0, 

In  order  to  process  this  large  collection,  we  decided 
to  sub-divide  it  into  8 1  pieces,  each  of  which  was  about 
300  MB  in  size.  Each  piece  contained  approximately 
85,000  documents  and  approximately  100,000  unique 
terms  (although  the  term-by-document  matrix  for  each 
piece  was  approximately  85K  by  487K).  Each  piece  was 
treated  as  a  separate  collection  (similar  to  lfl4l).  and  the 
2008  queries  were  run  against  each  piece  using  EDLSI. 
The  result  vectors  were  then  combined  so  that  the  top 
100,000  scoring  documents  overall  could  be  submitted 
to  TREC.  The  architecture  model  appears  in  Figure  [2] 
In  Section  [6]  we  discuss  our  2008  TREC  submissions 
using  distributed  EDLSI. 
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Fig.  2.  Distributed  EDLSI 


3  Power  Norm 

A  variety  of  well-known  and  effective  term  weighting 
schemes  can  be  used  when  forming  the  document  and 
query  vectors  03  .  Term  weighting  can  be  split  into  three 
types:  A  local  weight  based  on  the  frequency  within  the 
document,  a  global  weight  based  on  a  term’s  frequency 
throughout  the  dataset,  and  a  normalizing  weight  that 
negates  the  discrepancies  of  varying  document  lengths. 
The  entries  in  the  document  vector  are  computed  by 
multiplying  the  global  weight  for  each  term  by  the  local 
weight  for  the  document-term  pair.  Normalization  may 
then  be  applied  to  the  document  and  query  vectors. 

Our  power  normalization  studies  employed  log- 
entropy  weighting.  The  purpose  of  the  log-entropy 
weighting  scheme  is  to  reduce  the  relative  importance  of 
high-frequency  terms  while  giving  words  that  distinguish 
the  documents  in  a  collection  a  higher  weight.  Once  the 
term  weight  is  computed  for  each  document,  we  then 
normalize  the  document  vectors  using  Equation  [3]  In 
this  equation  dtw  is  the  document  term  weight,  qtw  is 
the  query  term  weight,  dc  is  the  number  of  terms  in  the 


document,  qc  is  the  number  of  terms  in  the  query,  p 
is  the  normalization  parameter,  and  the  sum  is  over  all 
terms  in  the  query. 


Power  normalization  is  designed  to  reduce,  but  not 
completely  eliminate,  the  advantage  that  longer  docu¬ 
ments  have  within  a  collection.  Without  normalization, 
long  documents  would  dominate  the  top  ranks  because 
they  contain  so  many  terms.  In  contrast,  cosine  nor¬ 
malization  ensures  in  all  documents  have  exactly  the 
same  weight  for  retrieval  purposes.  Experiments  from 
the  TREC  Legal  competition  in  2007  showed  that  power 
normalization  using  p  =  1/3  and  p  =  1/4  results  in 
retrieval  performance  improvements  over  cosine  normal¬ 
ization  for  the  2006  and  2007  query  sets  {9). 
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4  BM25 

The  BM25  algorithm  was  introduced  at  TREC  3  lfl3l. 
It  combines  an  inverse  collection  frequency  weight  with 
collection  specific  scaling  for  documents  and  queries. 
The  weight  function  for  a  given  document  ( d )  and  query 
(Q)  appears  in  Equation  [4]  In  this  equation,  idft  refers 
to  the  inverse  document  frequency  weight  for  a  given 
term  (Equation  0,  K  is  given  by  Equation  [6]  tf  is  the 
number  of  times  term  t  appears  in  document  d ,  qtf  is  the 
number  of  times  term  t  appears  in  the  query  Q ,  N  is  the 
number  of  documents  in  the  collection,  n  is  the  number 
of  documents  containing  t,  dl  is  the  document  length  (we 
measured  this  in  words),  adl  is  the  average  document 
length  for  the  corpus  (also  measured  in  words),  and  b, 
kl,  and  are  tuning  parameters. 


wd 


y-  -if •  (^1  +  1  )tf  (k3  +  l)qtf 

hi  tJt{K  +  tf)  (k3+qtf) 


(4) 

(5) 


A'  =  tl((1-4)+i’^)  ,6) 
The  full  BM25  weighting  scheme  is  slightly  more 
complicated  than  this  and  requires  two  additional  tuning 
parameters,  but  BM25  reduces  to  the  above  formula 
when  relevance  feedback  is  not  used  and  when  those  two 
parameters  take  their  standard  values.  Interested  readers 
will  find  details  in  m. 

We  recently  compared  BM25  to  power  normalization 
and  determined  that  BM25  outperforms  power  normal¬ 
ization  at  top  ranks,  but  power  normalization  outper¬ 
forms  BM25  in  terms  of  recall  after  rank  300  E).  Our 
submissions  using  BM25  and  power  normalization  for 
TREC  Legal  2008  are  described  in  Sections  [7]  and  [8] 


the  threshold  ThresK  the  document  was  considered 
‘relevant’;  otherwise  it  was  considered  ‘not  relevant’.  We 
then  measured  precision,  recall,  and  FI  for  each  query 
and  computed  the  average  across  all  queries.  We  used 
the  TREC2006  and  TREC2007  relevance  judgment  files 
for  computing  precision,  recall,  and  FI.  All  unjudged 
documents  were  considered  not  relevant.  Again,  using 
only  judged  documents  did  not  provide  sufficient  training 
data,  in  our  opinion.  This  approach  provided  data  that 
were  more  consistent  across  runs. 

Two  methods  were  used  for  determining  the  optimal 
ThresK-  The  first  method  determined  the  optimal  FI 
for  each  query  individually  by  incrementing  K  from  1 
to  100,000  and  measuring  FI.  The  document  score  at  the 
optimal  K  was  determined  to  be  Thresx  for  a  given 
query.  The  average  Thresx  for  each  query  was  used  to 
identify  the  final  Thresx  for  a  query  set. 

The  second  approach  involved  iterating  through 
thresholds  to  identify  which  threshold  gave  us  the  opti¬ 
mal  FI  for  the  entire  query  set.  For  power  normalization, 
thresholds  from  3  to  .01  were  tried  (in  increments  of 
.02);  for  BM25,  thresholds  of  400  to  25  were  tried 
(in  increments  of  5).  The  maximum  FI  determined  the 
optimal  Thresx  value.  We  measured  FI  to  3  significant 
digits.  When  ties  for  the  best  FI  were  found,  maximum 
recall  was  used  as  a  tie  breaker  to  determine  the  optimal 

Thresx- 

Both  techniques  for  determining  ThresK  were  used 
on  the  queries  for  2006  and  2007  separately  and  then  for 
2006  and  2007  combined,  for  both  power  normalization 
and  BM25.  The  optimal  threshold  values  appear  in 
Table  Q]  Because  two  different  approaches  were  used, 
they  always  produced  different  optimal  values  (although 
sometimes  only  slightly  different).  The  lower  of  these 
was  used  as  the  K  threshold,  and  the  higher  was  used 
as  the  Kj l  threshold. 


5  Defining  K  and  Kh 

We  ran  a  variety  of  experiments  to  determine  how  to 
identify  the  optimal  K  and  Kh  for  each  query.  Our 
approach  centered  on  finding  a  scoring  threshold  to  use 
for  determining  optimal  I\  and  Kh  values.  After  several 
false  starts  using  the  judged  documents  from  2006  and 
2007,  we  determined  that  the  collection  size  for  judged 
only  was  insufficient  for  training.  Thus,  we  used  the 
entire  collection,  in  conjunction  with  the  2006  and  2007 
queries  and  judgment  data,  for  training  purposes. 

Initially  we  ran  the  2006  and  2007  queries  against 
the  IIT  CDIP  corpus  and  developed  a  pseudo  submis¬ 
sion  file  containing  the  top  100,000  scoring  documents 
and  their  scores  for  each  query.  We  then  used  these 
files  as  input  and  measured  FI  at  a  variety  of  cutoff 
threshold  levels.  If  a  document  score  was  greater  than 


6  Ad  Hoc  Experiments  with  Dis¬ 
tributed  EDLSI 

In  this  section  we  discuss  our  distributed  EDLSI  exper¬ 
iments  for  the  TREC  2008  competition. 

6.1  Experimental  Design 

Our  main  focus  during  these  experiments  was  the  devel¬ 
opment  and  implementation  of  the  system.  Initially  we 
hoped  to  treat  the  entire  collection  as  a  single  LSI  space. 
To  accomplish  this,  we  planned  to  implement  EDLSI 
with  the  folding-up  algorithm  ED  .  Folding-up  combines 
folding-in  |3),  (2|  with  SVD  updating  U2l.  lfT8l.  |f2| 
to  maintain  the  integrity  of  the  LSI  space  (with  simple 
folding-in  the  columns  of  T  and  D  generally  become 
less  orthogonal  with  every  added  term  and  document, 
respectively).  We  initially  sub-divided  the  collection  into 
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TABLE  1 

Determining  Optimal  K  and  Kh 


Weighting 

Query 

Method  1 

Method  2 

Scheme 

Set 

ThresK 

FI 

recall 

ThresK 

FI 

recall 

BM25 

TREC  2006 

162.4 

.104 

.177 

190 

.044 

.123 

BM25 

TREC  2007 

307.9 

.130 

.164 

280 

.054 

.234 

BM25 

Combined  2006/2007 

238.7 

.117 

.170 

215 

.040 

.236 

Power  normalization 

TREC  2006 

.655 

.050 

.187 

.800 

.026 

.075 

Power  normalization 

TREC  2007 

1.997 

.135 

.171 

1.720 

.061 

.259 

Power  normalization 

Combined  2006/2007 

1.358 

.095 

.179 

1.700 

.032 

.139 

81  pieces  and  used  traditional  SVD  on  the  first  piece. 
We  then  attempted  to  integrate  the  remaining  80  pieces 
via  the  folding-up  process.  Unfortunately,  the  folding-up 
process  is  memory  intensive,  and  we  estimated  that  the 
process  would  run  approximately  35  days;  we  did  not 
have  sufficient  time  for  it  to  complete. 

Our  next  approach  combined  distributed  EDLSI  with 
folding-up.  Our  design  included  distributing  EDLSI 
across  8  processors,  each  processing  about  10  chunks 
of  data.  The  first  chunk  would  perform  the  SVD,  the 
remaining  9  would  be  folded-up.  We  estimated  that  this 
process  would  take  about  10  days,  and  once  again  we 
did  not  have  sufficient  time. 

Finally  we  decided  to  run  a  fully  distributed  EDLSI 
system  with  no  folding-up.  In  this  system,  we  performed 
the  SVD  on  each  of  the  81  pieces  separately,  used  LSI 
to  run  the  queries  against  the  SVD  space,  and  combined 
the  LSI  results  with  the  traditional  vector  space  retrieval 
results  (EDLSI).  Interestingly,  we  were  able  to  run  this 
algorithm  on  an  IBM  x3755  8-way  with  4  AMD  Opteron 
8222  SE  3.0GHz  dual-core  processors  and  128GB  RAM 
running  64-bit  RedHat  Linux  in  approximately  4  hours. 

We  used  only  the  request  text  portion  of  each  query 
for  all  runs. 

6.2  Choosing  the  k  Parameter 

For  any  LSI  system,  the  question  of  which  k  to  use 
must  be  addressed.  To  consider  this  question,  we  tested 
our  distributed  EDLSI  with  folding-up  system  on  a 
subset  of  the  IIT  CDIP  corpus  using  the  TREC  2007 
queries.  In  this  training  run,  we  extracted  all  documents 
that  were  judged  for  any  of  the  2007  queries  from  the 
full  IIT  CDIP  collection.  This  collection  contained  ap¬ 
proximately  22,000  documents  and  85,000  terms.  After 
preliminary  testing  to  ensure  that  the  results  for  TREC 
Legal  2007  were  similar  to  other  collections,  we  tested 
values  for  k  from  15  to  49  (in  increments  of  2)  and 
values  for  x  from  .16  to  .22  (in  increments  of  .02).  The 
optimal  values  were  k  =  41  and  x  =  .2.  These  values  are 
consistent  with  the  values  identified  in  the  early  EDLSI 
experiments  described  in  0. 


The  distributed  EDLSI  system  (without  folding-up) 
with  k  =  41  and  x  =  .2  was  run  against  the  2008  queries 
and  was  submitted  for  judging  as  UCEDLSIa. 

One  of  the  most  interesting  things  about  EDLSI  is  that 
it  works  with  very  few  dimenions  of  the  SVD  space;  thus 
we  decided  to  submit  a  second  run  with  an  even  lower  k 
value.  We  expected  a  lower  k  value  to  provide  improved 
results  because  keeping  k  constant  while  increasing  the 
number  of  divisions  (over  the  space  we  used  for  the 
training  run)  resulted  in  a  greater  number  of  values  being 
used  in  the  7).,  Sk,  and  Dp-  matrices.  For  instance,  at 
k  =  41,  there  are  a  total  of  4.4  x  108  numbers  being 
stored  in  7).,  ,S). ,  and  /.!/,.  for  8  partitions,  but  there  are 
18.8  x  108  numbers  being  stored  in  Tk,  Sk  and  Dk  for 
80  partitions. 

However,  with  80  partitions  and  k  =  10,  there  are 
4.6  x  108  numbers  in  Tk,  Sk  and  Dk .  We  speculate 
that  this  may  indicate  that  roughly  the  same  amount 
of  noise  has  been  eliminated  from  the  system  as  with 
8  partitions  and  k  =  41.  Therefore  we  decided  to  use 
k  =  10  and  x  =  .2  for  our  second  submission  run  for 
2008  (UCEDLSIb).  All  experiments  to  date  with  EDLSI 
seem  to  indicate  that  .2  is  a  suitable  weighting  factor,  and 
0  shows  that  performance  does  not  vary  much  with  x, 
even  on  fairly  large  collections. 

6.3  Baseline  Run 

Because  comparison  to  a  complete  LSI  run  was  not 
feasible  for  a  collection  this  size,  we  chose  instead  to 
compare  our  distributed  EDLSI  system  to  the  traditional 
vector  space  system  that  forms  a  part  of  EDLSI.  Our 
hypothesis  is  that  EDLSI  will  elicit  enough  term  rela¬ 
tionship  information  from  LSI  to  boost  the  performance 
of  the  vector  space  model.  Our  vector  space  baseline  was 
submitted  as  UrsinusVa. 

6.4  Defining  K  and  Kh  for  EDLSI 

For  UCEDLSIa,  the  K  and  Kh  thresholds  were  set  using 
the  TREC  2007  training  run  via  the  process  described 
in  Section  [5]  This  was  not  ideal,  as  explained  above. 
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Fig.  3.  Distributed  EDLSI  Result  Summary 
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because  only  the  judged  documents  were  included  in  the 
EDLSI  2007  training  run,  not  the  full  collection.  The  run 
was  optimized  using  threshold  values  from  .01  to  1.00 
in  steps  of  .01.  This  yielded  the  values  Thresx  =  -207 
and  ThresKh  =  .227.  These  same  values  were  used  for 
the  vector  baseline  run  UrsinusVa. 

Initially  we  planned  to  use  these  values  for  UCEDL- 
Slb  run  also,  but  a  glance  at  the  output  indicated  that 
the  thresholds  were  too  low  to  yield  meaningful  results. 
Almost  all  queries  for  UCEDLSIb  had  I\  and  Kh  values 
close  to  100,000.  As  a  result,  we  raised  these  thresholds, 
to  ThresK  =  -300  and  ThreSKh  =  -350  for  UCEDLSIb 
because  they  seemed  to  produce  more  reasonable  K  and 
Kh  values.  We  should  have  trusted  the  original  data 
however  (the  top  scoring  runs  at  TREC  Legal  all  had 
average  K  values  at  or  near  100,000).  A  follow-up  run 
which  set  K  =  100,  000  for  all  queries,  resulted  in  an 
Estimate  FI  at  K  that  was  11  times  larger  than  the 
submitted  run  (.1168  vs  .0094);  this  would  have  made 
UCEDLSIb  our  top  scoring  run. 

6.5  Results 

Figure  [3]  shows  a  comparison  of  the  two  EDLSI  submis¬ 
sions  and  the  vector  baseline  using  a  variety  of  metrics. 
EDLSI  clearly  outperforms  vector  space  retrieval  across 
the  board.  Furthermore,  the  performance  of  EDLSIa, 
the  optimized  EDLSI  run,  is  not  dramatically  better 
than  the  performance  of  EDLSIb,  the  baseline  EDLSI 
run  (k  =  10),  although  it  is  slightly  better  on  most 


metrics.  The  exception  is  estimated  precision  at  FI, 
where  EDLSIb  outperforms  EDLSIa.  This  is  a  direct 
result  of  the  average  K  value  submitted  for  the  two  runs. 
Average  K  for  EDLSIa  was  15,225;  for  EDLSIb  it  was 
only  1535.  Interestingly,  the  average  I\  for  the  vector 
baseline  was  even  lower  (802),  but  it  still  did  not  come 
close  to  matching  the  performance  of  either  EDLSI  run. 

7  Ad  Hoc  Experiments  using  BM25 
and  Power  Normalization 

In  this  section  we  describe  our  submitted  runs  for  BM25 
and  power  normalization. 

7.1  Experimental  Design 

Our  experiments  in  2007  focused  on  comparing  power 
normalization  to  cosine  normalization.  We  discovered 
that  power  normalization  significantly  outperformed  co¬ 
sine  normalization,  and  we  also  identified  the  optimal 
power  parameter  for  the  2006  and  2007  queries  0. 
Subsequently,  we  ran  some  experiments  that  compared 
power  normalization  to  BM25  0  and  learned  that  BM25 
has  better  retrieval  metrics  (recall  and  precision)  at  top 
ranks,  but  power  normalization  has  better  recall  as  rank 
increases.  Furthermore,  power  normalization  overtakes 
BM25  in  terms  of  recall  at  about  rank  300.  Thus,  for 
2008  we  wanted  to  compare  BM25  to  power  normaliza¬ 
tion.  UrsinusBM25a  and  UrsinusPwrA  are  our  baseline 
runs  for  BM25  and  power  normalization. 
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As  noted  in  Section  [4]  BM25  requires  a  large  number 
of  tuning  parameters.  For  our  UrsinusBM25a  baseline 
run  we  set  b  =  .6,  k  1  =  2.25  and  fc3  =  3.  These  are 
the  average  of  the  optimal  values  for  the  TREC  Legal 
2006  and  TREC  Legal  2007  query  sets.  We  used  number 
of  terms  to  represent  document  size  and  the  average 
was  435;  the  total  number  of  indexed  documents  was 
6,827,940. 

In  section  [3]  we  noted  that  power  normalization  also 
requires  a  parameter,  the  normalization  factor,  p.  The 
TREC  2006  normalization  factor  was  used  for  our  Ursi- 
nusPwrA  submission.  The  reason  for  this  is  described  in 
Section  17.21 

In  addition  to  comparing  BM25  and  power  normal¬ 
ization,  we  wanted  to  see  if  automatic  query  expansion 
could  be  used  to  improve  retrieval  performance.  Thus  we 
ran  training  experiments  using  the  TREC  Legal  2007 
queries.  In  these  experiments  we  extracted  the  most 
highly  weighted  5  and  10  terms  from  the  top-ranked 
document,  as  well  as  the  top  three  documents,  and  added 
these  terms  to  the  original  query  before  rerunning  the 
search.  For  both  BM25  and  power  normalization,  we 
required  the  document  to  have  some  minimal  weight  be¬ 
fore  chosing  terms  to  add.  Multiple  document  thresholds 
were  tested,  and  between  0  and  30  terms  were  added  to 
each  query. 

Results  from  the  automatic  relevance  feedback  tests 
were  submitted  as  runs  UrsinusBM25b  (5  terms  from 
the  top  document,  with  a  minimum  score  of  200), 
UrsinusPwrB  (5  terms  from  the  top  document  with  a 
minimum  score  of  1.0),  and  UrsinusPwrC  (5  terms  from 
the  top  3  documents  with  a  minimum  score  of  1.4). 

We  used  only  the  request  text  portion  of  each  query 
for  all  runs. 

7.2  Defining  K  and  Kh  for  Power  Normalization 
and  BM25 

Optimal  I\  and  K/,  values  were  identified  as  described  in 
Section  [5]  For  BM25,  this  resulted  in  a  Thresx  =  215 
and  Thresxh  =  239.  These  values  were  identified  by 
combining  the  TREC  Legal  2006  and  2007  values  into  a 
single  run  of  the  optimization  loop.  This  threshold  was 
used  for  both  BM25  runs. 

Training  using  both  the  2006  and  2007  query  sets 
for  power  normalization  to  find  the  optimal  the  ThreSk 
values  for  the  2008  queries  led  to  K  values  that  were 
very  small  (we  never  let  K  =  0,  but  there  were  many 
queries  with  K  =  1).  Therefore,  we  decided  to  use  the 
Thresx  and  Thresxh  values  that  were  identified  by 
training  on  the  2006  queries  only,  instead  of  using  the 
combined  values,  because  this  produced  more  reasonable 
results  for  K.  This  leads  us  to  conclude  that  the  2008 
queries  are  more  like  the  2006  queries  than  the  2007 
queries  (or  some  combination  of  the  two),  so  we  also 


used  the  optimal  p  for  the  2006  query  set  (p  =  .36) 
rather  than  an  average  of  the  optimal  p  for  2006  and 
2007  (p  =  .32).  The  corresponding  Thresx  =  -655 
and  Thresxh  =  -800  values  were  used  for  the  power 
normalization  runs. 

7.3  Results 

Figure  [4]  shows  a  comparison  of  the  BM25  run  with 
relevance  feedback  and  the  BM25  baseline  run.  The 
relevance  feedback  run  outperformed  the  baseline  run 
on  every  statistic  except  estimated  precision  at  K.  This 
was  not  surprising  as  the  average  K  for  the  relevance 
feedback  run  was  almost  three  times  the  average  K 
for  the  baseline  run  (10,168  vs.  3157)  and  precision 
decreases  as  additional  documents  are  retrieved. 

Figure  0  shows  the  results  for  the  three  power  nor¬ 
malization  runs.  The  results  for  power  norm  are  more 
mixed.  Although  all  three  runs  resulted  in  similar  re¬ 
trieval  performance  across  the  statistics  shown,  there  is 
no  run  that  consistently  outperforms  the  other  two.  The 
baseline  run  outperforms  the  relevance  feedback  runs  for 
estimated  recall  at  B,  estimated  precision  at  FI,  recall 
at  B,  and  precision  at  B.  PowerC  adds  in  the  top  5 
terms  from  the  top  3  documents,  if  the  term  weight 
is  greater  than  1.4  and  performs  nearly  identically  to 
the  baseline  run  (PowerA)  across  the  board.  PowerC  is 
usually  less  than  PowerA,  only  slightly  outperforming 
PowerA  when  the  estimated  precision  at  B  is  used. 
This  suggests  that  PowerC  is  adding  terms  that  are 
slightly  unhelpful.  PowerB  does  significantly  better  than 
PowerA  on  the  competition  metric  (estimated  FI  at  K). 
It  outperforms  both  PowerA  and  PowerC  for  estimated 
precision  at  B,  estimated  recall  at  K,  estimated  FI  at  K, 
mean  average  precision  (MAP),  and  MAP  using  judged 
documents  only.  The  average  K  for  PowerB  is  3292,  for 
PowerA  it  is  905,  and  for  PowerC  it  is  850.  Interestingly, 
although  PowerB  retrieves  approximately  three  times  as 
many  documents  at  K  on  average,  the  precision  at  K 
is  only  slightly  less  than  precision  at  K  for  PowerA  and 
PowerC.  Recall  is  again  naturally  much  better  due  to  the 
larger  K  value. 

Finally,  Figure  [6]  compares  the  performance  of  the 
optimal  BM25  to  the  two  best  Power  normalization 
schemes.  BM25  is  the  clear  winner.  It  soundly  outper¬ 
forms  power  normalization  on  the  track  metric  of  esti¬ 
mated  FI  at  K.  There  is  again  a  three-to-one  advantage  in 
terms  of  documents  retrieved  (BM25b  retrieves  10,168 
on  average,  PowerB  retrieves  3292),  but  BM25b  man¬ 
ages  to  retrieve  this  large  number  of  documents  without 
sacrificing  precision.  Indeed,  BM25b  outperforms  the 
power  normalization  schemes  on  all  three  FI  metrics.  In 
fact,  the  only  algorithm  that  happens  to  beat  BM25b  on 
any  of  the  at-K  based  metrics  is  EDLSIa,  which  outper¬ 
forms  all  other  metrics  in  estimated  recall  at  K.  EDLSIa 
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Fig.  4.  BM25  Baseline  vs.  Automated  Relevance  Feedback 
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Fig.  5.  Power  Baseline  vs.  Automated  Relevance  Feedback 
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Fig.  6.  Comparison  of  BM25  to  Power  Normalization 
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appears  to  accomplish  this  feat  by  simply  retrieving  more 
documents  (average  K  is  15,225)  and  pays  dearly  in 
precision  for  this  recall  advantage.  PowerA  and  PowerC 
manage  to  outperform  BM25b  on  the  statistic  that  was 
used  for  the  competition  is  2007;  both  of  these  schemes 
have  a  larger  estimated  recall  at  B. 

8  Relevance  Feedback  runs  using 
BM25  and  Power  Normalization 

In  this  section  we  describe  our  submissions  and  results 
for  the  relevance  feedback  task. 

8.1  Experimental  Design 

Our  relevance  feedback  submissions  leveraged  our  work 
on  automatic  relevance  feedback.  We  decided  to  look 
at  the  top  weighted  terms  from  all  documents  that 
were  judged  relevant.  Terms  were  added  to  the  query 
only  if  their  global  term  weight  met  a  certain  weight 
threshold.  Some  minimal  training  was  done  to  determine 
the  optimal  number  of  terms  and  optimal  term  weight 
before  we  ran  out  of  time.  We  used  only  the  request  text 
portion  of  each  query  for  all  runs. 

Our  relevance  feedback  runs  are  summarized  below: 

•  UCRFBM25BL:  BM25  baseline  using  the  same 
parameters  as  UrsinusBM25a 

•  UCRFPwrBL:  Power  Norm  baseline  using  the  same 
parameters  as  UrsinusPwrA 


•  UCBM25T5Th5:  BM25  with  5  terms  over  weight 
5  added  to  each  query 

•  UCBM25T10Th5:  BM25  with  10  terms  over  weight 
5  added  to  each  query 

•  UCPwrT5Th5:  Power  Norm  with  5  terms  over 
weight  5  added  to  each  query 

•  UCPwrT10Th5:  Power  Norm  with  10  terms  over 
weight  5  for  each  query 

8.2  Defining  K  and  Kh  for  Relevance  Feedback 

The  ThresK  and  ThresKh  values  from  the  ad  hoc 
experiments  were  also  used  for  the  relevance  feedback 
experiments. 

8.3  Results 

The  first  interesting  feature  that  we  noted  is  the  use 
of  the  ThresK  and  ThresKh  values  from  the  ad  hoc 
experiments  resulted  in  a  far  larger  value  of  K  for  the 
relevance  feedback  runs.  The  basic  power  normalization 
run  in  the  ad  hoc  submissions  resulted  in  an  average  K 
value  of  905;  for  the  relevance  feedback  runs,  the  average 
K  value  was  1547.  More  strikingly,  after  the  top  5  terms 
were  added,  the  average  K  jumped  to  11,880.  When  ten 
terms  were  added,  the  value  increased  to  13,233.  For  the 
BM25  submission  the  average  K  rose  even  faster.  The 
baseline  relevance  feedback  run  had  a  I\  value  of  564, 
which  was  significantly  lower  than  the  ad  hoc  average 
K  (3157),  but  when  five  terms  were  added,  this  jumped 
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Fig.  7.  Power  vs.  BM25  Relevance  Feedback  Run  Comparison 


BM25  vs.  Power  Norm  -  Relevance  Feedback  Runs 


Fig.  8.  Power  Relevance  Feedback  Run  Comparison  -  Adding  additional  terms 
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to  6517.  When  10  terms  were  added,  it  took  another  leap 
to  21,406. 

Figure  [7]  shows  a  comparison  of  the  results  for  the 
top  scoring  BM25  and  Power  normalization  relevance 
feedback  runs.  BM25  is  clearly  outperforming  power 
normalization.  It  will  be  interesting  to  optimize  the 
ThresK  and  ThresKh  values  in  post-hoc  testing,  to 
see  if  BM25  is  getting  its  advantage  by  merely  re¬ 
trieving  more  documents.  The  strong  showing  on  last 
year’s  metric  (Estimate  Recall  at  B)  leads  us  to  believe 
that  BM25  will  likely  to  continue  to  outperform  power 


normalization  after  additional  optimizations  are  run. 

Figures  [8]  and  [9]  show  the  comparison  for  each  weight¬ 
ing/normalization  scheme  as  additional  terms  are  added. 
We  now  detect  a  distinct  difference  between  the  power 
normalization  runs  and  the  relevance  feedback  runs. 
The  power  norm  consistently  does  better  as  more  terms 
are  added.  Adding  five  terms  outperforms  the  baseline; 
adding  ten  terms  outperforms  adding  only  five  terms. 
We  plan  post-hoc  testing  to  take  this  strategy  further  to 
determine  where  the  performance  peaks  and  begins  to 
degrade. 
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Fig.  9.  BM25  Relevance  Feedback  Run  Comparison  -  Adding  additional  terms 
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With  BM25,  however,  we  have  a  murkier  situation. 
The  baseline  run  actually  outperforms  the  relevance 
feedback  runs  when  the  metrics  from  2007  are  used  for 
comparison  (Estimated  Recall  and  Precision  at  B ).  It 
also  strongly  outperforms  the  other  runs  when  Recall 
at  B  is  measured.  The  run  with  ten  terms  added  does 
best  of  the  competition  metrics  for  2008  (FI,  Recall, 
and  Precision  at  K),  but  this  could  be  due  to  the  larger 
average  K  values.  Perhaps  most  interesting  of  all  is  that 
the  run  with  five  terms  added  does  best  when  the  mean 
average  precision  metrics  are  observed. 

9  Conclusions 

Our  experiments  show  that  the  Distributed  EDLSI  ap¬ 
proach  is  promising,  but  it  is  not  yet  fully  mature. 
It  soundly  outperformed  the  vector  baseline  approach, 
but  it  could  not  improve  upon  a  basic  BM25  system. 
Generally  BM25  was  the  most  consistent  performing 
algorithm  across  all  of  the  techniques  we  tested.  BM25b 
outperformed  power  normalization  on  both  the  ad  hoc 
and  the  relevance  feedback  tasks,  and  it  outperformed 
Distributed  EDLSI  on  the  ad  hoc  task.  Post  analysis  will 
be  done  to  optimize  the  parameter  settings  for  all  three 
approaches. 
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